Unsupervised Word Segmentation Without Dictionary

نویسندگان

  • Jason S. Chang
  • Tracy Lin
چکیده

This prototype system demonstrates a novel method of word segmentation based on corpus statistics. Since the central technique we used is unsupervised training based on a large corpus, we refer to this approach as unsupervised word segmentation. The unsupervised approach is general in scope and can be applied to both Mandarin Chinese and Taiwanese. In this prototype, we illustrate its use in word segmentation of Taiwanese Bible written in Hanzi and Romanized characters. Basically, it involves: z Computing mutual information, MI, between Hanzi and Romanized characters A and B. If A and B have a relatively high MI, we lean toward treating AB as a word. z Using a greedy method to form words of 2 to 4 characters in the input sentences. z Building an N-gram model from the results of first-round word segmentation z Segmenting words based on the N-gram model z Iterating between the above two steps: building N-gram and word segmentation Computing mutual information. Using mutual information is motivated by the observation of previous work by Hank and Church (1990) and Sproat and Shih (1990). If A and B have a relatively high MI that is over a certain threshold, we prefer to identify AB as a word over those having lower MI values. In the experiment with Taiwanese Bible, the system identified Hanzi and Romanized syllables. Out of those, we obtained pairs of consecutive single or double Hanzi characters and Romanized syllables. So those pairs are commonly known as character bigrams, trigrams, and fourgrams. We differed from the common N-gram calculation and treated those as pairs of character sequence in order to apply mutual information statistics. Table 1 shows some examples of the pairs and MI values. We have excluded pairs having MI 2.2 or lower.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Unsupervised Word Segmentation with Hierarchical Language Modeling

This paper proposes a novel unsupervised morphological analyzer of arbitrary language that does not need any supervised segmentation nor dictionary. Assuming a string as the output from a nonparametric Bayesian hierarchical n-gram language model of words and characters, “words” are iteratively estimated during inference by a combination of MCMC and an efficient dynamic programming. This model c...

متن کامل

h . R ep or t T R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of ...

متن کامل

T R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji Strings

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of ...

متن کامل

Unsupervised and Semi-supervised Myanmar Word Segmentation Approaches for Statistical Machine Translation

In statistical machine translation (SMT), word segmentation is generally a necessary step for languages that do not naturally delimit words. For many low-resource languages there are no word segmentation tools, and research on word segmentation for these languages is often quite scarce. In this paper, we study several plausible methods for Myanmar word segmentation for machine translation in or...

متن کامل

How does Dictionary Size Influence Performance of Vietnamese Word Segmentation?

Vietnamese word segmentation (VWS) is a challenging basic issue for natural language processing. This paper addresses the problem of how does dictionary size influence VWS performance, proposes two novel measures: square overlap ratio (SOR) and relaxed square overlap ratio (RSOR), and validates their effectiveness. The SOR measure is the product of dictionary overlap ratio and corpus overlap ra...

متن کامل

Experiments on Unsupervised Chinese Word Segmentation and Classification

There are several problems encountered for Chinese language processing as Chinese is written without word delimiters. The difficulty in defining a word makes it even harder. This paper explores the possibility of automatically segmenting Chinese character sequences into words and classifying these words through distributional analysis in contrast with the usual approaches that depends on dictio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003